Skip to content

fix: ensure genes_gnomad_convert_v4 picks mane_select and canonical transcripts #101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

tedil
Copy link
Contributor

@tedil tedil commented Mar 8, 2025

The previous code relied on sort order and consistent cli tool behaviour; I suspect it may have been enough to simply remove the reverse flag from the "3,3r" sort parameter. However, to be on the safe side, we re-implemented the logic using polars. Should be easier to read and more stable now.
(This probably also needs to be done for genes_gnomad_convert_v2_1_1)

Summary by CodeRabbit

  • Chores
    • Added a new dependency requirement for a high-performance data processing library and removed an obsolete environment configuration.
  • Refactor
    • Updated the gene constraints data conversion process by moving to a dedicated script, improving data consistency with explicit type conversion and clearer handling of missing values.

@tedil tedil requested a review from holtgrewe March 8, 2025 10:26
Copy link
Contributor

coderabbitai bot commented Mar 8, 2025

Walkthrough

The changes update environment configurations and data processing for gnomAD gene constraints. A new polars dependency is introduced in the environment file, while a previously maintained environment file for polars is removed. In the gnomAD rules, an inline Python function is removed and replaced with a script call to handle data conversion. The referenced script is enhanced with additional operations for DataFrame processing, such as casting non-string columns to strings and filling null values.

Changes

File(s) Change Summary
environment.yml and rules/.../envs/polars.yaml Added new polars dependency (polars >=1.23,<2) in environment.yml with usage context, and removed the old polars.yaml environment file that specified polars 1.24.0 and the conda-forge channel.
rules/.../gnomad.smk Removed the inline function run_genes_gnomad_constraints_v4_to_tsv and updated the genes_gnomad_convert_v4 rule to call the external script "scripts/gnomad_constraints_v4_to_tsv.py".
rules/.../scripts/gnomad_constraints_v4_to_tsv.py Enhanced the data processing in the main function by adding operations to cast non-string columns to strings and to fill null values with "NA" during DataFrame transformation.

Sequence Diagram(s)

sequenceDiagram
    participant SNK as Snakemake Rule
    participant PY as Python Script
    participant DF as DataFrame
    SNK->>PY: Execute "gnomad_constraints_v4_to_tsv.py"
    PY->>DF: Join with ensembl_xlink & perform transformations
    DF-->>PY: Return processed data (cast columns, fill nulls)
    PY->>SNK: Output TSV file
Loading

Poem

I'm a rabbit full of cheer,
Hopping through code with a happy ear.
New deps and scripts in a well-laid row,
Casting and filling—data in flow.
With whiskers twitching at each new line,
I leap into changes, feeling just fine!
🐇✨

Warning

There were issues while running some tools. Please review the errors and either fix the tool’s configuration or disable the tool if it’s a critical failure.

🔧 YAMLlint (1.35.1)
rules/work/genes/envs/polars.yaml

[Errno 2] No such file or directory: 'rules/work/genes/envs/polars.yaml'

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@tedil tedil self-assigned this Mar 8, 2025
@tedil tedil requested a review from stolpeo March 8, 2025 10:27
Copy link

github-actions bot commented Mar 8, 2025

Please check for formatting and linting issues. You can use make lint to:

  • Check your Bash style with beautysh and formatting with shellcheck.
  • Check your Python style with black, linting with flake8.
  • Check your Snakemake style with snakefmt.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (4)

49-52: Use idiomatic Python boolean comparisons

Avoid explicit comparisons to boolean literals.

-discard = df.filter(pl.col("mane_or_canonical") == False).select(
+discard = df.filter(~pl.col("mane_or_canonical")).select(
    ["gene", "gene_id"]
)
🧰 Tools
🪛 Ruff (0.8.2)

49-49: Avoid equality comparisons to False; use if not pl.col("mane_or_canonical"): for false checks

Replace with not pl.col("mane_or_canonical")

(E712)


52-52: Consider using a logging mechanism instead of print

For better integration with the Snakemake workflow, consider using a proper logging mechanism instead of print statements.

-print("No mane_select or canonical transcript found for ", discard.select("gene"))
+import logging
+logging.info(f"No mane_select or canonical transcript found for {discard.select('gene')}")

54-55: Use idiomatic Python boolean comparisons

Avoid explicit comparisons to boolean literals.

-df = (
-    df.filter(pl.col("mane_or_canonical") == True)
+df = (
+    df.filter(pl.col("mane_or_canonical"))
🧰 Tools
🪛 Ruff (0.8.2)

54-54: Avoid equality comparisons to True; use if pl.col("mane_or_canonical"): for truth checks

Replace with pl.col("mane_or_canonical")

(E712)


67-71: Handle undefined snakemake variable for better IDE integration

The snakemake variable is undefined in this file and only available when the script is run through Snakemake. Consider adding a conditional import or check to make the script more IDE-friendly and to address static analysis warnings.

 if __name__ == "__main__":
+    try:
+        # When run by snakemake
         main(
             snakemake.input.tsv,
             snakemake.input.xlink_ensembl,
             snakemake.output.tsv,
             snakemake.output.tsv_md5,
         )
+    except NameError:
+        # For development/testing outside of snakemake
+        import argparse
+        parser = argparse.ArgumentParser()
+        parser.add_argument("--constraint-tsv", required=True)
+        parser.add_argument("--ensembl-xlink-tsv", required=True)
+        parser.add_argument("--output-tsv", required=True)
+        parser.add_argument("--output-tsv-md5", required=True)
+        args = parser.parse_args()
+        main(
+            args.constraint_tsv,
+            args.ensembl_xlink_tsv,
+            args.output_tsv,
+            args.output_tsv_md5
+        )
🧰 Tools
🪛 Ruff (0.8.2)

67-67: Undefined name snakemake

(F821)


68-68: Undefined name snakemake

(F821)


69-69: Undefined name snakemake

(F821)


70-70: Undefined name snakemake

(F821)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7557523 and 0bc189c.

📒 Files selected for processing (3)
  • environment.yml (1 hunks)
  • rules/work/genes/gnomad.smk (1 hunks)
  • rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py

49-49: Avoid equality comparisons to False; use if not pl.col("mane_or_canonical"): for false checks

Replace with not pl.col("mane_or_canonical")

(E712)


54-54: Avoid equality comparisons to True; use if pl.col("mane_or_canonical"): for truth checks

Replace with pl.col("mane_or_canonical")

(E712)


62-62: Undefined name shell

(F821)


67-67: Undefined name snakemake

(F821)


68-68: Undefined name snakemake

(F821)


69-69: Undefined name snakemake

(F821)


70-70: Undefined name snakemake

(F821)

🔇 Additional comments (3)
environment.yml (1)

56-57: LGTM - appropriate dependency added

The addition of Polars with version constraints (>=1.23,<2) is correctly formatted and includes a clear explanatory comment detailing its purpose.

rules/work/genes/gnomad.smk (1)

109-110: Good refactoring approach

Moving the complex processing logic from an inline function to a dedicated script improves maintainability and readability, which aligns well with the PR objectives.

rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (1)

41-48:

❓ Verification inconclusive

Verify sorting logic for transcript selection

The current sorting and grouping logic selects the last record for each gene after sorting. Ensure that this approach correctly prioritizes MANE select and canonical transcripts as intended. Consider whether the sort order ["gene", "mane_select", "canonical", ...] gives the right precedence.


🏁 Script executed:

#!/bin/bash
# Check the data structure and values for mane_select and canonical fields
head -n 10 $(find work/download/genes/gnomad -name "*.constraint_metrics.tsv" | head -n 1) | grep -E "mane_select|canonical" -A 2

Length of output: 194


Action Required: Manually Verify Sorting Priority for Transcript Selection

  • The current logic sorts records by ["gene", "mane_select", "canonical", "cds_length", "transcript"] and then selects the last record per gene. This approach should, in theory, pick transcripts where mane_select or canonical are flagged (if booleans sort as expected).
  • The automated check did not yield output because the expected directory (work/download/genes/gnomad) was not found. Please verify that the file paths are correct or run a manual check to ensure that the mane_select and canonical fields contain the expected values and that the sorting indeed prioritizes the desired transcripts.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (5)

36-39: Consider adding error handling for file operations

The file reading operations don't have error handling. If one of the input files doesn't exist or has an unexpected format, the script will crash without a helpful error message.

-df = pl.read_csv(constraint_tsv_path, separator="\t", null_values="NA")
-ensembl_xlink = pl.read_csv(
-    ensembl_xlink_tsv_path, separator="\t", null_values="NA"
-)
+try:
+    df = pl.read_csv(constraint_tsv_path, separator="\t", null_values="NA")
+    ensembl_xlink = pl.read_csv(
+        ensembl_xlink_tsv_path, separator="\t", null_values="NA"
+    )
+except Exception as e:
+    print(f"Error reading input files: {e}")
+    raise

49-51: Use Pythonic boolean comparison

Using == False is less Pythonic than using ~ (not) operator or not with Polars expressions.

-discard = df.filter(pl.col("mane_or_canonical") == False).select(
+discard = df.filter(~pl.col("mane_or_canonical")).select(
    ["gene", "gene_id"]
)
🧰 Tools
🪛 Ruff (0.8.2)

49-49: Avoid equality comparisons to False; use if not pl.col("mane_or_canonical"): for false checks

Replace with not pl.col("mane_or_canonical")

(E712)


53-55: Use Pythonic boolean comparison

Similar to the previous comment, use the direct boolean expression instead of == True.

-df = (
-    df.filter(pl.col("mane_or_canonical") == True)
+df = (
+    df.filter(pl.col("mane_or_canonical"))
🧰 Tools
🪛 Ruff (0.8.2)

54-54: Avoid equality comparisons to True; use if pl.col("mane_or_canonical"): for truth checks

Replace with pl.col("mane_or_canonical")

(E712)


65-71: Add error handling and documentation for Snakemake dependency

The script assumes the snakemake object is available, which is true when run through Snakemake but might cause issues if someone tries to run the script directly. Consider adding a comment or try-except block to handle this dependency.

if __name__ == "__main__":
+    # This script is designed to be run by Snakemake
+    # The snakemake object is automatically available in the namespace
+    try:
        main(
            snakemake.input.tsv,
            snakemake.input.xlink_ensembl,
            snakemake.output.tsv,
            snakemake.output.tsv_md5,
        )
+    except NameError as e:
+        print("This script is designed to be run by Snakemake")
+        print(f"Error: {e}")
+        import sys
+        sys.exit(1)
🧰 Tools
🪛 Ruff (0.8.2)

67-67: Undefined name snakemake

(F821)


68-68: Undefined name snakemake

(F821)


69-69: Undefined name snakemake

(F821)


70-70: Undefined name snakemake

(F821)


3-72: Add docstrings for better documentation

The code would benefit from docstrings to explain the purpose of the script and the main function, including its parameters and what it does.

+"""
+Script to convert gnomAD constraints data to TSV format.
+
+This script processes gnomAD constraint data, prioritizing mane_select and 
+canonical transcripts, and converts it to a standardized TSV format.
+"""
import polars as pl

def main(
    constraint_tsv_path, ensembl_xlink_tsv_path, output_tsv_path, output_tsv_md5_path
):
+    """
+    Process gnomAD constraint data and convert it to TSV format.
+    
+    Args:
+        constraint_tsv_path: Path to the input constraint TSV file
+        ensembl_xlink_tsv_path: Path to the Ensembl cross-reference TSV file
+        output_tsv_path: Path to write the output TSV file
+        output_tsv_md5_path: Path to write the MD5 checksum of the output TSV file
+    
+    The function prioritizes mane_select and canonical transcripts when
+    selecting entries for each gene.
+    """
🧰 Tools
🪛 Ruff (0.8.2)

49-49: Avoid equality comparisons to False; use if not pl.col("mane_or_canonical"): for false checks

Replace with not pl.col("mane_or_canonical")

(E712)


54-54: Avoid equality comparisons to True; use if pl.col("mane_or_canonical"): for truth checks

Replace with pl.col("mane_or_canonical")

(E712)


62-62: Undefined name snakemake

(F821)


67-67: Undefined name snakemake

(F821)


68-68: Undefined name snakemake

(F821)


69-69: Undefined name snakemake

(F821)


70-70: Undefined name snakemake

(F821)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ff7634 and a5fc1e3.

📒 Files selected for processing (2)
  • environment.yml (1 hunks)
  • rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • environment.yml
🧰 Additional context used
🧠 Learnings (1)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (1)
Learnt from: tedil
PR: varfish-org/varfish-db-downloader#101
File: rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py:62-62
Timestamp: 2025-03-10T08:16:40.589Z
Learning: When writing Python scripts for Snakemake (used with the `script:` directive), the `snakemake` object is automatically available in the script's namespace. You can access Snakemake's shell function using `snakemake.shell()` rather than importing subprocess.
🪛 Ruff (0.8.2)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py

49-49: Avoid equality comparisons to False; use if not pl.col("mane_or_canonical"): for false checks

Replace with not pl.col("mane_or_canonical")

(E712)


54-54: Avoid equality comparisons to True; use if pl.col("mane_or_canonical"): for truth checks

Replace with pl.col("mane_or_canonical")

(E712)


62-62: Undefined name snakemake

(F821)


67-67: Undefined name snakemake

(F821)


68-68: Undefined name snakemake

(F821)


69-69: Undefined name snakemake

(F821)


70-70: Undefined name snakemake

(F821)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Run-Test-Mode
🔇 Additional comments (6)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (6)

1-5: Good use of Polars for data processing

Using Polars is an excellent choice for handling dataframe operations, providing better performance and more modern API than pandas. The function signature clearly defines the required inputs and outputs.


6-26: LGTM: Well-structured column mapping

The column mapping dictionary is clearly defined and easy to maintain if column names change in the future.


30-34: Elegant approach for constructing output columns

The approach of concatenating lists to construct the final columns list is clean and maintainable.


41-48: Good filtering logic for mane_select and canonical transcripts

The implementation correctly addresses the PR objective by prioritizing transcripts based on mane_select and canonical flags, which improves reliability over sort-order dependency.


56-60: LGTM: Good data transformation pipeline

The data transformation pipeline is well-structured, performing column additions, joins, and null filling in a clear sequence.


62-62: Correct usage of Snakemake shell function

Based on the retrieved learning, this is the correct way to use Snakemake's shell function for executing shell commands in a Snakemake script.

🧰 Tools
🪛 Ruff (0.8.2)

62-62: Undefined name snakemake

(F821)

stolpeo
stolpeo previously approved these changes Mar 10, 2025
Copy link

@stolpeo stolpeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (1)

50-56: Use Pythonic boolean comparisons

Current boolean comparisons use explicit equality checks with True/False literals, which is not idiomatic Python.

-    discard = df.filter(pl.col("mane_or_canonical") == False).select(
+    discard = df.filter(~pl.col("mane_or_canonical")).select(
     ["gene", "gene_id"]
    )
    print("No mane_select or canonical transcript found for ", discard.select("gene"))
    df = (
-        df.filter(pl.col("mane_or_canonical") == True)
+        df.filter(pl.col("mane_or_canonical"))
🧰 Tools
🪛 Ruff (0.8.2)

50-50: Avoid equality comparisons to False; use if not pl.col("mane_or_canonical"): for false checks

Replace with not pl.col("mane_or_canonical")

(E712)


55-55: Avoid equality comparisons to True; use if pl.col("mane_or_canonical"): for truth checks

Replace with pl.col("mane_or_canonical")

(E712)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a5fc1e3 and 9151214.

📒 Files selected for processing (1)
  • rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (1)
Learnt from: tedil
PR: varfish-org/varfish-db-downloader#101
File: rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py:62-62
Timestamp: 2025-03-10T08:16:40.589Z
Learning: When writing Python scripts for Snakemake (used with the `script:` directive), the `snakemake` object is automatically available in the script's namespace. You can access Snakemake's shell function using `snakemake.shell()` rather than importing subprocess.
🪛 Ruff (0.8.2)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py

50-50: Avoid equality comparisons to False; use if not pl.col("mane_or_canonical"): for false checks

Replace with not pl.col("mane_or_canonical")

(E712)


55-55: Avoid equality comparisons to True; use if pl.col("mane_or_canonical"): for truth checks

Replace with pl.col("mane_or_canonical")

(E712)


68-68: Undefined name snakemake

(F821)


69-69: Undefined name snakemake

(F821)


70-70: Undefined name snakemake

(F821)


71-71: Undefined name snakemake

(F821)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Run-Test-Mode
🔇 Additional comments (6)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (6)

1-3: Good use of imports and dependencies

You've correctly imported the Polars library for data manipulation and the shell function from snakemake.shell based on the previous feedback. This is the right approach for Snakemake scripts.


4-35: Well-structured data mapping and column definitions

The code defines a clear mapping between source and destination column names, as well as additional columns needed. The organization of this section makes the code readable and maintainable.


37-40: Effective data loading with Polars

The use of Polars for reading TSV files with appropriate parameters (separator and null_values) is a good choice. Polars provides better performance than pandas for these types of operations.


42-53: Clear data transformation pipeline

The data transformation pipeline is well-constructed with appropriate sorting, filtering, and grouping operations to ensure the correct transcript selection.

🧰 Tools
🪛 Ruff (0.8.2)

50-50: Avoid equality comparisons to False; use if not pl.col("mane_or_canonical"): for false checks

Replace with not pl.col("mane_or_canonical")

(E712)


62-63: Proper MD5 checksum generation using Snakemake shell

You've properly used the Snakemake shell function for generating the MD5 checksum. This is the correct approach based on the previous feedback and retrieved learning that the snakemake object is automatically available in the script's namespace.


66-72: Proper use of Snakemake integration

The script correctly uses the snakemake object to access inputs and outputs when executed by Snakemake. While static analysis tools may flag "undefined name snakemake", this is expected behavior as the snakemake object is injected into the script's namespace by Snakemake itself.

🧰 Tools
🪛 Ruff (0.8.2)

68-68: Undefined name snakemake

(F821)


69-69: Undefined name snakemake

(F821)


70-70: Undefined name snakemake

(F821)


71-71: Undefined name snakemake

(F821)

@stolpeo
Copy link

stolpeo commented Mar 10, 2025

closes #103

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (8)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (8)

1-3: Consider adding a docstring to explain script's purpose.

The script lacks a module-level docstring that would help users understand its purpose, inputs, and expected outputs. Since this script is replacing inline functionality from a Snakemake file, clear documentation would be valuable.

 import polars as pl
 from snakemake.shell import shell
 
+"""
+Process gnomAD v4 constraint data prioritizing MANE select and canonical transcripts.
+
+This script converts gnomAD v4 constraint data to the VarFish TSV format, ensuring
+that MANE select and canonical transcripts are prioritized for each gene.
+"""
+

4-6: Add a function docstring with parameter descriptions.

Adding a proper docstring to the main function would improve code readability and maintenance.

 def main(
     constraint_tsv_path, ensembl_xlink_tsv_path, output_tsv_path, output_tsv_md5_path
 ):
+    """
+    Convert gnomAD v4 constraint data to VarFish TSV format.
+    
+    Args:
+        constraint_tsv_path: Path to the input gnomAD constraint TSV file
+        ensembl_xlink_tsv_path: Path to the Ensembl cross-reference TSV file
+        output_tsv_path: Path where the output TSV file will be written
+        output_tsv_md5_path: Path where the MD5 checksum of the output file will be written
+    """

42-49: Add a comment explaining the transcript selection logic.

The sorting and filtering logic is complex and crucial to understanding how the preferred transcript is selected for each gene. A comment would clarify the intent.

     df = (
+        # Sort to prioritize MANE select and canonical transcripts, then by CDS length
+        # Group by gene and take the last record (highest priority transcript per gene)
         df.sort(by=["gene", "mane_select", "canonical", "cds_length", "transcript"])
         .filter(pl.col("gene_id").str.starts_with("ENSG"))
         .group_by(["gene"])
         .last()
         .sort(["gene", "mane_select", "canonical"])
         .with_columns(mane_or_canonical=pl.col("canonical") | pl.col("mane_select"))
     )

50-53: Use is_not() for boolean column comparison.

Instead of comparing a boolean column to False, use the more idiomatic Polars expression with is_not().

-    discard = df.filter(pl.col("mane_or_canonical") == False).select(
+    discard = df.filter(pl.col("mane_or_canonical").is_not()).select(
         ["gene", "gene_id"]
     )
     print("No mane_select or canonical transcript found for ", discard.select("gene"))
🧰 Tools
🪛 Ruff (0.8.2)

50-50: Avoid equality comparisons to False; use if not pl.col("mane_or_canonical"): for false checks

Replace with not pl.col("mane_or_canonical")

(E712)


53-53: Consider using logging instead of print.

Using Python's logging module would provide better control over output verbosity and destination, which is especially important in a Snakemake workflow.

+import logging
+
+# At the top of the file
+logging.basicConfig(level=logging.INFO)
+
+# Replace the print statement with
-    print("No mane_select or canonical transcript found for ", discard.select("gene"))
+    logging.info(f"No mane_select or canonical transcript found for {discard.select('gene')}")

55-55: Use is_() for boolean column comparison.

Instead of comparing a boolean column to True, use the more idiomatic Polars expression with is_().

     df = (
-        df.filter(pl.col("mane_or_canonical") == True)
+        df.filter(pl.col("mane_or_canonical").is_())
         .rename(columns_src_dst)
         .with_columns(pl.lit(None).alias(col) for col in additional_columns)
         .join(ensembl_xlink, on=["ensembl_transcript_id"])
         .select(columns_dst)
     )
🧰 Tools
🪛 Ruff (0.8.2)

55-55: Avoid equality comparisons to True; use if pl.col("mane_or_canonical"): for truth checks

Replace with pl.col("mane_or_canonical")

(E712)


57-57: Consider using list comprehension for clarity.

The current code uses generator expression inside with_columns(). For better readability, consider using a list comprehension.

-        .with_columns(pl.lit(None).alias(col) for col in additional_columns)
+        .with_columns([pl.lit(None).alias(col) for col in additional_columns])

1-72: Add error handling for robust execution.

The script lacks error handling which could be helpful for diagnosing issues in the Snakemake workflow. Consider adding try-except blocks around critical operations.

 def main(
     constraint_tsv_path, ensembl_xlink_tsv_path, output_tsv_path, output_tsv_md5_path
 ):
+    try:
         columns_src_dst = {
             # ... existing code ...
         }
         
         # ... existing code ...
         
         df.write_csv(output_tsv_path, separator="\t")
         shell(f"md5sum {output_tsv_path} > {output_tsv_md5_path}")
+    except Exception as e:
+        logging.error(f"Error processing gnomAD constraints: {e}")
+        raise
🧰 Tools
🪛 Ruff (0.8.2)

50-50: Avoid equality comparisons to False; use if not pl.col("mane_or_canonical"): for false checks

Replace with not pl.col("mane_or_canonical")

(E712)


55-55: Avoid equality comparisons to True; use if pl.col("mane_or_canonical"): for truth checks

Replace with pl.col("mane_or_canonical")

(E712)


67-67: Undefined name snakemake

(F821)


68-68: Undefined name snakemake

(F821)


69-69: Undefined name snakemake

(F821)


70-70: Undefined name snakemake

(F821)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9151214 and 197787e.

📒 Files selected for processing (1)
  • rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (1)
Learnt from: tedil
PR: varfish-org/varfish-db-downloader#101
File: rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py:62-62
Timestamp: 2025-03-10T08:16:40.589Z
Learning: When writing Python scripts for Snakemake (used with the `script:` directive), the `snakemake` object is automatically available in the script's namespace. You can access Snakemake's shell function using `snakemake.shell()` rather than importing subprocess.
🪛 Ruff (0.8.2)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py

50-50: Avoid equality comparisons to False; use if not pl.col("mane_or_canonical"): for false checks

Replace with not pl.col("mane_or_canonical")

(E712)


55-55: Avoid equality comparisons to True; use if pl.col("mane_or_canonical"): for truth checks

Replace with pl.col("mane_or_canonical")

(E712)


67-67: Undefined name snakemake

(F821)


68-68: Undefined name snakemake

(F821)


69-69: Undefined name snakemake

(F821)


70-70: Undefined name snakemake

(F821)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Run-Test-Mode
🔇 Additional comments (2)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (2)

62-62: Correct use of Snakemake shell function.

Great job using the shell function from snakemake.shell for generating the MD5 checksum. This is the recommended approach for Snakemake scripts.


65-71: Appropriate handling of Snakemake integration.

The script correctly accesses the snakemake object which is automatically available in the namespace when run through Snakemake's script: directive. This is the idiomatic way to write Snakemake scripts.

🧰 Tools
🪛 Ruff (0.8.2)

67-67: Undefined name snakemake

(F821)


68-68: Undefined name snakemake

(F821)


69-69: Undefined name snakemake

(F821)


70-70: Undefined name snakemake

(F821)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (5)

33-33: Consider adding a comment explaining the slice operation

The code skips the first value from columns_src_dst.values() (which is "ensembl_transcript_id"). This is intentional since it's included separately in the hardcoded part of the list, but this might not be immediately obvious to other developers.

-        + list(columns_src_dst.values())[1:]
+        # Skip first value "ensembl_transcript_id" as it's already included via the join operation
+        + list(columns_src_dst.values())[1:]

50-52: Avoid direct Boolean comparisons in Polars expressions

According to Polars best practices, it's better to avoid equality comparisons to boolean literals.

-    discard = df.filter(pl.col("mane_or_canonical") == False).select(
+    discard = df.filter(~pl.col("mane_or_canonical")).select(
🧰 Tools
🪛 Ruff (0.8.2)

50-50: Avoid equality comparisons to False; use if not pl.col("mane_or_canonical"): for false checks

Replace with not pl.col("mane_or_canonical")

(E712)


53-53: Consider using proper logging instead of print

For better integration with Snakemake's logging system, consider using the Python logging module instead of print statements.

-    print("No mane_select or canonical transcript found for ", discard.select("gene"))
+    import logging
+    logging.info(f"No mane_select or canonical transcript found for {discard.select('gene')}")

55-55: Avoid direct Boolean comparisons in Polars expressions

Similar to the previous comment, replace the explicit boolean comparison with a more idiomatic expression.

-        df.filter(pl.col("mane_or_canonical") == True)
+        df.filter(pl.col("mane_or_canonical"))
🧰 Tools
🪛 Ruff (0.8.2)

55-55: Avoid equality comparisons to True; use if pl.col("mane_or_canonical"): for truth checks

Replace with pl.col("mane_or_canonical")

(E712)


67-73: Add a comment explaining the snakemake object

The snakemake object is automatically available when the script is run through Snakemake's script: directive, but this might not be obvious to all developers.

 if __name__ == "__main__":
+    # The snakemake object is automatically available when this script is run through
+    # Snakemake's script: directive
     main(
         snakemake.input.tsv,
         snakemake.input.xlink_ensembl,
         snakemake.output.tsv,
         snakemake.output.tsv_md5,
     )
🧰 Tools
🪛 Ruff (0.8.2)

69-69: Undefined name snakemake

(F821)


70-70: Undefined name snakemake

(F821)


71-71: Undefined name snakemake

(F821)


72-72: Undefined name snakemake

(F821)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 197787e and c5cd93f.

📒 Files selected for processing (1)
  • rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (1)
Learnt from: tedil
PR: varfish-org/varfish-db-downloader#101
File: rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py:62-62
Timestamp: 2025-03-10T08:16:40.589Z
Learning: When writing Python scripts for Snakemake (used with the `script:` directive), the `snakemake` object is automatically available in the script's namespace. You can access Snakemake's shell function using `snakemake.shell()` rather than importing subprocess.
🪛 Ruff (0.8.2)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py

50-50: Avoid equality comparisons to False; use if not pl.col("mane_or_canonical"): for false checks

Replace with not pl.col("mane_or_canonical")

(E712)


55-55: Avoid equality comparisons to True; use if pl.col("mane_or_canonical"): for truth checks

Replace with pl.col("mane_or_canonical")

(E712)


69-69: Undefined name snakemake

(F821)


70-70: Undefined name snakemake

(F821)


71-71: Undefined name snakemake

(F821)


72-72: Undefined name snakemake

(F821)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Run-Test-Mode
🔇 Additional comments (3)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (3)

1-2: Good fix: Properly importing shell from snakemake.shell

This correctly implements the suggestion from previous reviews to use the Snakemake shell function rather than relying on undefined shell functions or subprocess.


42-49: Good implementation: Reliable transcript selection logic

The sorting and filtering logic ensures priority selection of transcripts with "mane_select" and "canonical" attributes, which directly addresses the PR objective of making the selection more reliable rather than depending on CLI tool sort order.


37-40:

❓ Verification inconclusive

Verify ensembl_xlink_tsv schema matches expectations

The code joins data using the "ensembl_transcript_id" column, but it's worth verifying that this column exists in the ensembl_xlink dataset with the expected values.

Also applies to: 58-58


🏁 Script executed:

#!/bin/bash
# Check for the existence and format of the ensembl_transcript_id column in the xlink file
head -n 5 $(find . -name "*.tsv" | grep -i ensembl | grep -i xlink) | cut -f 1-5

Length of output: 84


Below is a new shell script designed to help verify the schema of the TSV file. Please run it and manually check that the header row contains the "ensembl_transcript_id" column with the expected format.

Action Required: Verify ensembl_xlink Data Schema

  • The script searches for TSV files whose names include both "ensembl" and "xlink".
  • For each file found, it prints the header line with each column on its own line.
  • After execution, please ensure that one of the header entries is exactly "ensembl_transcript_id" and that its values (from subsequent lines) match the expected format.

Once you’ve run the script, please verify that the "ensembl_transcript_id" column is present and correctly formatted.


🏁 Script executed:

#!/bin/bash
# Find all TSV files containing both 'ensembl' and 'xlink' in their filename and display header columns
files=$(find . -iname "*ensembl*xlink*.tsv")
if [ -z "$files" ]; then
  echo "No TSV files matching 'ensembl*xlink*.tsv' were found."
  exit 1
fi

for file in $files; do
  echo "File: $file"
  echo "Header columns (each on a new line):"
  head -n 1 "$file" | tr "\t" "\n"
  echo "----------------------"
done

Length of output: 197


Manual Verification of ensembl_xlink Dataset Required

The automated searches did not locate any TSV file matching the expected pattern, which means we couldn’t confirm the presence of the "ensembl_transcript_id" column automatically. Please ensure manually that the file referenced by the ensembl_xlink_tsv_path variable:

  • Exists in the repository: Verify that the expected file (with a name related to “ensembl” and “xlink”) is present and correctly referenced.
  • Has the correct schema: Open the file and confirm that its header includes a column exactly named "ensembl_transcript_id" and that the column values meet the expected format.

If the file is found under a different naming convention or location, please update the code accordingly. If it’s missing entirely, add it or adjust the logic that joins data on that column.

@stolpeo stolpeo removed the request for review from holtgrewe March 11, 2025 09:56
@stolpeo stolpeo enabled auto-merge (squash) March 11, 2025 09:58
@stolpeo stolpeo disabled auto-merge March 11, 2025 09:59
@stolpeo stolpeo force-pushed the fix-gnomad-constraints-v4 branch from 3a98e70 to f8509f4 Compare March 11, 2025 14:04
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (1)

66-67: Duplicate if __name__ == "__main__" check

There are two identical if __name__ == "__main__" checks at lines 66 and 67. The first one is followed by an empty line, which suggests it might be a mistake.

-if __name__ == "__main__":
-
 if __name__ == "__main__":
     main(
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3a98e70 and f8509f4.

📒 Files selected for processing (11)
  • rules/work/genes/envs/polars.yaml (1 hunks)
  • rules/work/genes/gnomad.smk (1 hunks)
  • rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (1 hunks)
  • environment.yml (1 hunks)
  • rules/work/genes/envs/polars.yaml (0 hunks)
  • rules/work/genes/gnomad.smk (0 hunks)
  • rules/work/genes/gnomad.smk (0 hunks)
  • rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (1 hunks)
  • rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (2 hunks)
  • rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (0 hunks)
  • rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • rules/work/genes/envs/polars.yaml
  • rules/work/genes/envs/polars.yaml
🚧 Files skipped from review as they are similar to previous changes (6)
  • rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py
  • environment.yml
  • rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py
  • rules/work/genes/gnomad.smk
  • rules/work/genes/gnomad.smk
  • rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py
🧰 Additional context used
🧠 Learnings (1)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (1)
Learnt from: tedil
PR: varfish-org/varfish-db-downloader#101
File: rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py:62-62
Timestamp: 2025-03-10T08:16:40.589Z
Learning: When writing Python scripts for Snakemake (used with the `script:` directive), the `snakemake` object is automatically available in the script's namespace. You can access Snakemake's shell function using `snakemake.shell()` rather than importing subprocess.
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: Run-Test-Mode
  • GitHub Check: Check-Remote-URLs
🔇 Additional comments (3)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (3)

62-62: Great fix using snakemake.shell!

This change correctly addresses the previously identified issue with the undefined shell function. Using snakemake.shell() is the proper approach when writing Python scripts for Snakemake's script: directive, as the snakemake object is automatically available in the script's namespace.


42-56: Good implementation for selecting preferred transcripts

The sorting and filtering logic here implements the PR objective of ensuring genes_gnomad_convert_v4 properly selects mane_select and canonical transcripts. By prioritizing these transcript types explicitly rather than relying on CLI sort order, you've made the code more reliable and readable.

The diagnostic print statement for genes without preferred transcripts is also helpful for monitoring data quality.

🧰 Tools
🪛 Ruff (0.8.2)

50-50: Avoid equality comparisons to False; use if not pl.col("mane_or_canonical"): for false checks

Replace with not pl.col("mane_or_canonical")

(E712)


55-55: Avoid equality comparisons to True; use if pl.col("mane_or_canonical"): for truth checks

Replace with pl.col("mane_or_canonical")

(E712)


1-2: Appropriate imports for the implementation

The imports look good - using Polars for the data manipulation, which aligns with the PR description of reimplementing using the Polars library, and importing the Snakemake shell function that's used in line 62.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants